Skip to content

Reuse bucket when reduce terms buckets#20483

Draft
bowenlan-amzn wants to merge 2 commits into
opensearch-project:mainfrom
bowenlan-amzn:reuse-bucket
Draft

Reuse bucket when reduce terms buckets#20483
bowenlan-amzn wants to merge 2 commits into
opensearch-project:mainfrom
bowenlan-amzn:reuse-bucket

Conversation

@bowenlan-amzn
Copy link
Copy Markdown
Member

@bowenlan-amzn bowenlan-amzn commented Jan 26, 2026

Description

When the requested field's cardinality is high, the reduce of term aggregation bucket creation become heavy.
Every cardinality is represented by a bucket. For each reduce, we merge the bucket of same key from multiple responses together, and create a new bucket for that.

This PR tries to reuse the existing bucket from the 1st response by in-place updating the field values, instead of creating one.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 26, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for ae8b2ac: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for ef07406: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for be081cb: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.26%. Comparing base (d26ad17) to head (1c6961e).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20483      +/-   ##
============================================
- Coverage     73.35%   73.26%   -0.10%     
+ Complexity    73209    73169      -40     
============================================
  Files          5921     5932      +11     
  Lines        333798   333941     +143     
  Branches      48124    48128       +4     
============================================
- Hits         244862   244650     -212     
- Misses        69387    69713     +326     
- Partials      19549    19578      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot Bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Feb 28, 2026
@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot Bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Apr 1, 2026
Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

PR Reviewer Guide 🔍

(Review updated until commit 1c6961e)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Reuse bucket in-place during terms aggregation reduce

Relevant files:

  • server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java

Sub-PR theme: Update tests to handle in-place bucket mutation during reduce

Relevant files:

  • test/framework/src/main/java/org/opensearch/test/InternalAggregationTestCase.java
  • server/src/test/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregatorTests.java

⚡ Recommended focus areas for review

Mutation Side Effect

The in-place mutation of firstBucket (modifying docCount, aggregations, and docCountError) could cause issues if the same bucket object is referenced elsewhere (e.g., cached, shared across reduce calls, or used in partial reduces). Since buckets.getFirst() returns the actual bucket from the input list, mutating it may corrupt the original shard-level response data if it is reused or referenced after the reduce.

B firstBucket = buckets.getFirst();
if (firstBucket instanceof Bucket<?> mutableBucket) {
    mutableBucket.docCount = docCount;
    mutableBucket.aggregations = reducedSubAggs;
    mutableBucket.docCountError = docCountError;
    return firstBucket;
}
Partial Reduce Safety

During partial reduces (not just final reduce), reduceBucket is also called. Mutating the first bucket in-place during a partial reduce could corrupt intermediate results if those buckets are referenced by other data structures or if the partial reduce result is later used as input to another reduce step.

// In-place mutation: reuse first bucket instead of allocating new one
// Only applies to InternalTerms.Bucket subclasses (LongTerms, StringTerms, etc.)
// InternalMultiTerms.Bucket has a different hierarchy and uses createBucket()
B firstBucket = buckets.getFirst();
if (firstBucket instanceof Bucket<?> mutableBucket) {
    mutableBucket.docCount = docCount;
    mutableBucket.aggregations = reducedSubAggs;
    mutableBucket.docCountError = docCountError;
    return firstBucket;
}
return createBucket(docCount, reducedSubAggs, docCountError, firstBucket);
Verification Ordering

inputsForVerification is sorted after copying, but toReduce (which is built from inputs before sorting) is not re-sorted. The inputs list is sorted in-place, but toReduce was already populated before the sort. This means toReduce may have a different order than inputs, potentially causing inconsistencies in the reduce result compared to what inputsForVerification expects.

List<InternalAggregation> toReduce = new ArrayList<>();
toReduce.addAll(inputs);
// Sort aggs so that unmapped come last. This mimicks the behavior of InternalAggregations.reduce()
inputs.sort(INTERNAL_AGG_COMPARATOR);
inputsForVerification.sort(INTERNAL_AGG_COMPARATOR);
Reduced Test Coverage

The test was changed from iterating over all aggregations as the reduce leader to only using aggs.get(0). The original loop verified that reduce produces consistent results regardless of which aggregation leads the reduce. Removing this loop reduces test coverage for correctness across different reduce orderings.

// Call reduce once (as in production) - the first aggregation leads the reduce
InternalAggregation mergedAggs = aggs.get(0).reduce(aggs, ctx);
assertTrue(mergedAggs instanceof DoubleTerms);
long expected = numLongs + numDoubles;
List<? extends Terms.Bucket> buckets = ((DoubleTerms) mergedAggs).getBuckets();
assertEquals(4, buckets.size());
assertEquals("1.0", buckets.get(0).getKeyAsString());
assertEquals(expected, buckets.get(0).getDocCount());
assertEquals("10.0", buckets.get(1).getKeyAsString());
assertEquals(expected * 2, buckets.get(1).getDocCount());
assertEquals("100.0", buckets.get(2).getKeyAsString());
assertEquals(expected * 2, buckets.get(2).getDocCount());
assertEquals("1000.0", buckets.get(3).getKeyAsString());
assertEquals(expected, buckets.get(3).getDocCount());

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

PR Code Suggestions ✨

Latest suggestions up to 1c6961e
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix sort order mismatch between inputs and toReduce

inputs is sorted after toReduce is populated from it, but toReduce itself is not
re-sorted. This means toReduce retains the original unsorted order while inputs is
sorted, potentially causing a mismatch between what is reduced and what is verified.
The sort of inputs should happen before toReduce.addAll(inputs), or toReduce should
also be sorted.

test/framework/src/main/java/org/opensearch/test/InternalAggregationTestCase.java [378-379]

+// Sort aggs so that unmapped come last. This mimicks the behavior of InternalAggregations.reduce()
 inputs.sort(INTERNAL_AGG_COMPARATOR);
 inputsForVerification.sort(INTERNAL_AGG_COMPARATOR);
+List<InternalAggregation> toReduce = new ArrayList<>();
+toReduce.addAll(inputs);
Suggestion importance[1-10]: 6

__

Why: This is a valid observation - toReduce is populated from inputs before inputs is sorted, so toReduce has a different order than inputs. However, looking at the existing code (pre-PR), this was already the case, so this is a pre-existing issue not introduced by this PR. The suggestion's improved_code correctly shows reordering the sort before toReduce.addAll(inputs).

Low
Warn about unsafe in-place mutation of shared bucket

The in-place mutation of firstBucket is unsafe if the same bucket object is
referenced elsewhere (e.g., in the original shard results list that was passed in).
If buckets.get(0) is still referenced by the caller or stored in another data
structure, mutating it will corrupt those references. Consider whether the bucket
list is exclusively owned at this point, or add a defensive copy/clone before
mutating.

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java [563-570]

 B firstBucket = buckets.getFirst();
 if (firstBucket instanceof Bucket<?> mutableBucket) {
+    // Only mutate if this bucket is not shared/referenced elsewhere
+    // If buckets list is exclusively owned here, mutation is safe
     mutableBucket.docCount = docCount;
     mutableBucket.aggregations = reducedSubAggs;
     mutableBucket.docCountError = docCountError;
     return firstBucket;
 }
 return createBucket(docCount, reducedSubAggs, docCountError, firstBucket);
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a valid concern about in-place mutation, but the improved_code is essentially identical to the existing_code with only a comment added. The PR's test changes (deep copying inputs before reduction) suggest the authors are aware of mutation semantics. The suggestion doesn't provide a concrete fix, just a comment.

Low

Previous suggestions

Suggestions up to commit a55ceb9
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure reduce list matches sorted input order

inputs is sorted after toReduce is populated from it, but toReduce itself is not
re-sorted. This means toReduce retains the original unsorted order while inputs is
sorted, potentially causing a mismatch between what is reduced and what is verified.
The sort of inputs should happen before toReduce.addAll(inputs), or toReduce should
be re-sorted as well.

test/framework/src/main/java/org/opensearch/test/InternalAggregationTestCase.java [378-379]

 inputs.sort(INTERNAL_AGG_COMPARATOR);
 inputsForVerification.sort(INTERNAL_AGG_COMPARATOR);
+toReduce.clear();
+toReduce.addAll(inputs);
Suggestion importance[1-10]: 5

__

Why: This is a valid observation - toReduce is populated before inputs is sorted, so toReduce retains the original unsorted order. However, looking at the existing code, this was already the case before this PR (the sort happened after toReduce.addAll(inputs)), so this is a pre-existing issue rather than something introduced by this PR.

Low
Avoid mutating shared bucket references unsafely

The in-place mutation of firstBucket is unsafe if the same bucket object is
referenced elsewhere (e.g., in the original shard results list that was passed in).
Since buckets.get(0) is the actual object from the input list, mutating it could
corrupt the original aggregation results if they are referenced after reduction.
Consider only reusing the bucket when it is safe to do so, or document clearly that
callers must not retain references to input buckets after calling reduceBucket.

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/InternalTerms.java [563-570]

 B firstBucket = buckets.getFirst();
-if (firstBucket instanceof Bucket<?> mutableBucket) {
+// Only reuse if this is not the sole bucket (i.e., there are multiple shards contributing),
+// or ensure callers are aware that input buckets may be mutated.
+if (buckets.size() > 1 && firstBucket instanceof Bucket<?> mutableBucket) {
     mutableBucket.docCount = docCount;
     mutableBucket.aggregations = reducedSubAggs;
     mutableBucket.docCountError = docCountError;
     return firstBucket;
 }
 return createBucket(docCount, reducedSubAggs, docCountError, firstBucket);
Suggestion importance[1-10]: 4

__

Why: The concern about mutating firstBucket is valid in principle, but the PR already handles this by copying inputs before reduction in the test framework. The suggested fix of checking buckets.size() > 1 doesn't actually solve the aliasing problem (a single-shard result could still be referenced elsewhere), making the improved code logically inconsistent with the stated concern.

Low

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 5, 2026

❌ Gradle check result for a55ceb9: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

Persistent review updated to latest commit 1c6961e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 6, 2026

✅ Gradle check result for 1c6961e: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant